Looking at case studies can be a good way to gain intuition about how to build Conv-nets. An architecture that works well on 1 task can often wor well on other tasks as well.

Classic Networks

LeNet-5

  • From 1998
  • Goal: Recognise Handwritten Grayscale images (ie MNIST dataset)
  • Was built in 1998 so it uses Average-pooling. These days we’d use Max-pooling.
  • No padding was used - wasn’t common at the time.
  • 60 000 parameters used - relatively small to what we’d use in modern day networks.
  • At the time ReLU wasn’t common - so sigmoid activation functions were used.

Full Architecture of the network:

LeNet-5 visualised

LeNet-5 visualised

  1. Starts with \(32 \times 32 \times 1\) Grayscale image
  2. Apply a convolution with filter-size = 5, n-channels = 6 and stride = 1. The representation is now \(28 \times 28 \times 6\).
  3. Apply Average-pooling with filter_size = 2 and stride = 2. Object size is now \(14 \times 14 \times 6\)
  4. Apply a convolution with filter-size = 5, n-channels = 16 and stride = 1. The representation is now \(10 \times 10 \times 16\).
  5. Apply Average-pooling with filter_size = 2 and stride = 2. Object size is now \(5 \times 5 \times 16\).
  6. Apply a fully-connected layer with 120 Neurons.
  7. Apply a fully-connected layer with 84 Neurons.
  8. Apply a softmax layer with 10 Neurons. (It was done slightly differently in the paper, but this is what we’d use in modern days).

Parameter Count by layer:

  1. 0
  2. CNN Layer 1: \(5 \times 5 \times 6= 150\)
  3. Pooling Layer 1: 0
  4. CNN Layer 2: \(5 \times 5 \times 16= 400\)
  5. Pooling Layer 2: 0
  6. Dense Layer 1: \(400 \times 120 = 48 000\)
  7. Dense Layer 2: \(120 \times 84 = 10 080\)
  8. Softmax Layer: \(120 \times 10 = 1 200\)

Total count: 59 830. Note how most of the parameters are used by the fully connected layers.

AlexNet

AlexNet visualised

AlexNet visualised

  • From 2012
  • 56 million parameters
  • Convolutional layers are same-filters (so with zero-padding).

Full Architecture of Network:

For convenience, I summarise each row as follows:

CNN (filter-width, stride, n-filters), then Max-Pooling (filter-width, stride), then Object dimension after pooling: (width, height, channels)

  1. Image representation: (227,227,3)
  2. (11, 4, 96), (3,2),(27,27,96)
  3. (5,1, 256), (3,2), (13,13,256)
  4. (3,1, 384), no pooling, (13,13,384)
  5. (3,1, 384), no pooling, (13,13,384)
  6. (3,1, 256), (3,2), (13,13,256)
  7. Unroll, (9216,1)
  8. Dense-Layer 1, (4096)
  9. Dense-Layer 2, (4096)
  10. Softmax layer with 1000 nodes (this would be based on the classification task at hand)

Number of Parameters: 1. \(11 \times 11 \times 96 = 11616\) 2. \(5 \times 5 \times 256 = 6400\) 3. \(3 \times 3 \times 384 = 3456\) 4. \(3 \times 3 \times 384 = 3456\) 5. \(3 \times 3 \times 256 = 2304\) 6. 0 7. \(9216 \times 4096 = 37 748 736\) 8. \(4096 \times 4096 = 16 777 216\) 9. \(4096 \times 1000 = 4 096 000\)

Total Count: 58 649 184 parameters.

VGG - 16 Network

VGG - 16 visualised

VGG - 16 visualised

  • 2015
  • Simpler network design.
  • All conv-layers are \(3 \times 3\) with stride = 1, same
  • All Max-pooling layers are \(2 \times 2\) with stride = 2
  • The number of channels in each conv-layer tends to double
  • 138 million parameters

Full Architecture:

  1. Image representation: \(224 \times 224 \times 3\)
  2. 2x conv-layers of with 64 channels each
  3. pooling
  4. 2x conv-layers of with 128 channels each
  5. pooling
  6. 3x conv-layers of with 256 channels each
  7. pooling
  8. 3x conv-layers of with 512 channels each
  9. pooling
  10. 3x conv-layers of with 512 channels each
  11. pooling
  12. Dense-Layer
  13. Dense Layer
  14. Softmax 1000

ResNets

Residual Blocks

5 Residual blocks - The activations are passed directly forward, skipping a layer

5 Residual blocks - The activations are passed directly forward, “skipping” a layer

in a residual block, \(a^{[l+2]} = g(z^{[l+2]} + a^{[l]})\) for every even \(l\). Empirically, with a plain feed-forward network, adding layers to a network eventually has a negative effect on your model’s accuracy on the training set. However, with residual blocks, adding layers gives a consistently decreasing training error on the training set. So having residual blocks allows you to train much deeper networks (hundreds of layers).

Why do ResNets work well?

A large network with layer l being very deep in

A large network with layer l being very deep in

It makes the Identity Function easy to learn!

Consider the case where you have a very large Neural Network of residual blocks as in the image above leading to a layers \(l\) and \(l+2\). imagine there is large weight decay, leading to tiny values of \(w^{[l+2]}\) and \(b^{[l+2]}\). Then,

\[ \begin{array} {rcl} a^{[l+2]} &=& g(z^{[l+2]} + a^{[l]}) \\ &=& g(w^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) \\ &\approx& g(a^{[l]}) \\ &\approx& a^{[l]}. \\ \end{array} \]

The result above gives us the result that if gradients are vanishing, a layer will simply become an identity layer. This means adding 2 layers gives a network with the performance of a large network will be at least as good as a smaller network of the same architecture. This lets us add on layers without fear of worse performance on the training set. In highly parameterised plain networks, it becomes difficult to find parameters that learn the identity network.

What if the dimension of \(a^{[l+2]}\) and \(a^{[l]}\) don’t match?

Say \(a^{[l+2]}\) has 256 nodes and \(a^{[l]}\) only has 128. Then we define \(a^{[l+2]} = g(z^{[l+2]} + w_s a^{[l]})\) where \(w_s \in \mathbb{R}^{256 \times 128}\). This fixes the dimensions. As for \(w_s\), we can set it as a learnable parameter or hard-code it to implement zero-padding so that the extra nodes are just set to zero. No guideline is given as to which is better, but I like the elegance of zero-padding, but I guess it doesn’t work if the second layer is smaller in size.

ResNets on images.

Example ResNet for images. Note its a bunch of convolutional layers with an occasional pooling layer.

Example ResNet for images. Note its a bunch of convolutional layers with an occasional pooling layer.

  • From 2015
  • Uses same convolutions, so layers are always the same dimension (until pooling)
  • Follows a bunch of identical convolutions, then pools, then repeats.

Inception Network

Consists of a series of Inception Layers. We will first describe a \(1 \times 1\) convolution, then go on to describe Inception Layers

\(1 \times 1\) Convolutions (Network in Network)

1 \times 1 Convolutional layer

\(1 \times 1\) Convolutional layer

From 2013. Although confusingly named, this type of layer basically reduces the number of channels of a layer while keeping the dimensions of each channel the same. Also remember, that even if we kept the dimensions the same, using a conv-layer has an activation function (typically ReLU), so we would still be adding some sort of nonlinearity to the network.

Inception Layer

Inception-layer - note 1-by-1 conv used for dimension reduction. Also note for Pooling layer, the 1-by-1 conv is implemented afterwards

Inception-layer - note 1-by-1 conv used for dimension reduction. Also note for Pooling layer, the 1-by-1 conv is implemented afterwards

  • From 2014
  • The idea is to not have to worry about choosing between filter hyperparameters. Just use them all!
  • Leads to enormous calculation cost - not parameters, just the sheer number of multiplications needing to be done
  • As a result of the above, 1-by-1 convolutions are used as a dimensionality-reduction step, often reducing the calculation burden 10-fold.
Inception-Network - Consists of a bunch of inception layers.

Inception-Network - Consists of a bunch of inception layers.